# How to Speed Up EEGUnity with Built-in Multithreading ## 1. Introduction EEGUnity (>0.6.0) now provides built-in multithreading through the `num_workers` parameter in `UnifiedDataset`. The new version integrates multithreading directly into the core pipeline. This ensures: - Cleaner user code - Safer concurrency management - Better performance scaling - No nested or duplicated thread pools The design philosophy is similar to PyTorch's `DataLoader(num_workers=...)`. ------------------------------------------------------------------------ ## 2. Basic Usage To enable multithreading, simply set `num_workers` when creating a `UnifiedDataset`. ``` python from eegunity import UnifiedDataset u_dataset = UnifiedDataset( dataset_path="your_dataset_root", domain_tag="your_domain_tag", num_workers=8 # number of threads ) ``` If `num_workers=0` (default), EEGUnity runs sequentially. If `num_workers>0`, EEGUnity internally uses a thread pool to parallelize supported operations. No additional concurrency code is required. ------------------------------------------------------------------------ ## 3. Where Multithreading Is Applied Multithreading is automatically applied in the following stages: ### 3.1 Dataset Scanning (Parser Stage) When `dataset_path` is provided, EEGUnity scans directories and builds the locator. File parsing (e.g., `.fif`, `.mat`, `.csv`) is parallelized using `num_workers`. This significantly accelerates large dataset initialization. ### 3.2 Batch Processing (EEGBatch Stage) Functions that rely on `batch_process()` automatically inherit multithreading, including: - `export_h5Dataset()` - `save_as_other()` - `process_mean_std()` - `format_channel_names()` Each row in the locator can be processed in parallel. ------------------------------------------------------------------------ ## 4. Internal Execution Model EEGUnity uses a `ThreadPoolExecutor` internally when `num_workers > 0`. However, not all steps can be parallelized safely. Some operations must remain sequential, such as: - Writing to the same output file - Maintaining deterministic order - Updating shared state EEGUnity solves this by: - Parallelizing independent row-level tasks - Keeping order-sensitive operations outside thread pools - Collecting results before final writing steps This hybrid design ensures correctness while maximizing throughput. ------------------------------------------------------------------------ ## 5. Choosing num_workers There is no single optimal value. It depends on: - CPU core count - Dataset size - I/O speed (SSD vs HDD) - Task complexity ### General Recommendations - Start with `num_workers = number_of_CPU_cores` - For I/O-heavy workloads, slightly higher values may help - For CPU-heavy signal processing, stay near core count - For small datasets, parallelism may not provide noticeable benefit Example: ``` python import os u_dataset = UnifiedDataset( dataset_path="your_dataset_root", domain_tag="your_domain_tag", num_workers=os.cpu_count() ) ``` Always benchmark on your own system. ------------------------------------------------------------------------ ## 6. Practical Example ``` python from eegunity import UnifiedDataset u_dataset = UnifiedDataset( dataset_path="your_dataset_root", domain_tag="your_domain_tag", num_workers=8 ) # Export to HDF5 in parallel u_dataset.eeg_batch.export_h5Dataset("output_path") ``` This will: 1. Parse dataset files in parallel 2. Process each locator row concurrently 3. Safely write results in correct order ------------------------------------------------------------------------ ## 7. Important Notes - Do not manually create external thread pools. - Avoid nesting additional concurrency layers. - Ensure sufficient memory is available when increasing `num_workers`. - If debugging, temporarily set `num_workers=0` for deterministic behavior. ------------------------------------------------------------------------ ## 8. Summary The new built-in multithreading system: - Simplifies user code - Improves performance for large datasets - Ensures safe parallel execution - Requires only one parameter: `num_workers` By delegating concurrency management to EEGUnity, users can focus on dataset processing logic instead of thread orchestration.